智能论文笔记

Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.

translated by 谷歌翻译

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads

Evangelos Georganas , Dhiraj Kalamkar , Sasikanth Avancha , Menachem Adelman , Deepti Aggarwal , Cristina Anderson , Alexander Breuer , Jeremy Bruestle , Narendra Chaudhary , Abhisek Kundu

分类：人工智能

2021-04-12

在过去十年中，已经开发出新的深度学习（DL）算法，工作负载和硬件来解决各种问题。尽管工作量和硬件生态系统的进步，DL系统的编程方法是停滞不前的。 DL工作负载从DL库中的高度优化，特定于平台和不灵活的内核，或者在新颖的操作员的情况下，通过具有强大性能的DL框架基元建立参考实现。这项工作介绍了Tensor加工基元（TPP），一个编程抽象，用于高效的DL工作负载的高效，便携式实现。 TPPS定义了一组紧凑而多才多艺的2D张镜操作员（或虚拟张量ISA），随后可以用作构建块，以在高维张量上构建复杂的运算符。 TPP规范是平台 - 不可行的，因此通过TPPS表示的代码是便携式的，而TPP实现是高度优化的，并且特定于平台。我们展示了我们使用独立内核和端到端DL＆HPC工作负载完全通过TPPS表达的方法的效力和生存性，这在多个平台上优于最先进的实现。

translated by 谷歌翻译

深度学习系统needlargedatafortraining.Datasets的面部验证系统难以获得并容易出现隐私问题。由GAN等生成模型生成的合成数据可以是一个很好的选择。但是，我们表明，甘恩产生的数据容易出现偏见和公平问题。特别是在FFHQ数据集中训练的甘斯表明，在20-29岁年龄段的年龄组中产生白脸。我们还证明，当用于微调面部验证系统时，合成面部面孔会引起不同的影响，特别是针对种族属性的影响。这是使用$ dob_ {fv} $ metric测量的，该公制定义为gar@far far for face验证的标准偏差。

translated by 谷歌翻译